-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSoC 8-9 #10
GSoC 8-9 #10
Conversation
@Yashwants19 looks great! I'll read through it in more detail tomorrow morning. |
Thank you. This sounds good. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few quick comments on the documentation
README.md
Outdated
|
||
gonum | ||
|
||
If you would like to build the R bindings, make sure that R >= 3.5 is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to bump the minimum requirement up to R 4.0.0 or R 3.6.0. There were a lot of internal changes from R 3.5 to R 3.6.
One change that might bite us on an earlier version is the data.matrix()
conversion for factor()
inside of data.frame()
. Unfortunately, R 4.0.0 is when factors were handled by this method in the desired format.
data.matrix() now converts character columns to factors and from
this to integers.
https://stat.ethz.ch/pipermail/r-announce/2020/000653.html
If the goal is to support R 3.6.0, we'll probably need to backport the function change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I have updated the version to 4.0.0, may be further we can discussion it on main PR/issue(in mlpack/master
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aiming for R 4.0.0 is entirely reasonable. It is the current release. Supporting older release is a nice-to-have but really not that important. If you can do it "for free" or near-free, sure. Otherwise ... move on and stick with R 4.0.0,
Building the R bindings from scratch is a little more in-depth, though. For | ||
information on that, follow the instructions on the @ref build page, and be sure | ||
to specify @c -DBUILD_R_BINDINGS=ON to CMake; you may need to also set the | ||
location of the R program with @c -DR_EXECUTABLE=/path/to/R. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but we are already packaging the mlpack
source inside of the package in the mlpack-r
repository? I think there might be two "from source" categories here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually I'm not sure it's needed to have an mlpack-r
repository. Based on my understanding, an upload to CRAN happens not by, e.g., giving the location of a Github repository (like in Julia), but instead by directly uploading a .tar.gz
. This means that instead of an mlpack-r
repository, we could do one of the following:
(1) at every mlpack release, manually build with make r
, package what's in build/src/mlpack/bindings/r/
and upload to CRAN
(2) whenever a new release is tagged, fire off a Travis or Azure job that does (1) automatically
(3) same as (2) except use Jenkins
(4) same as (2) except use Github Actions
Does that work? Or will there be people who... don't want to use CRAN, but install directly from a Github URL or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we may require to do some discussion here. As per my knowledge its good do a R CMD check
on the package in the "proper" environment, before releasing/publishing the package. If we do R CMD check
using ctest
while building with mlpack
, it may be possible that for building the RcppMLPACK
R may use installed boost/cereal serialization files required to build mlpack or it may use some mlpack files which might not correctly travelled to RcppMLPACK
while cmaking. Hence I thought it may be helpful if we go for mlpack-r
with a R CMD check
in proper environment, Here Github Actions/Azure might help us with the package release after a successful build onmlpack-r
.
Or it may be possible that we may do these checks locally after every release in the proper environment and then go for a CRAN upload.
Just an idea, I am really sorry if I am missing something or making some blunder here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks for the explanation @Yashwants19. In that case, if we were using the mlpack-r
approach you described, would the mlpack-r
repository just contain a clone of the mlpack
repository at the appropriate tag?
I do agree that having some kind of system set up to run R CMD check
before submitting to CRAN would be great, although it seems like CRAN will automatically run R CMD check
too---so maybe we can just depend on CRAN failing the uploaded package as our "CI check" for a new release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe we can just depend on CRAN failing
Please don't. CRAN really is just a few overworked humans. It is not a test bed. We have e.g. Rhub for that.
Consider CRAN uploads as a manual, explicit step. Think Journal submission (which is exaggerating somewhat to make the point), not "random copy of a file to Dropbox" or another web service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eek, I didn't realize that. Yes, in this case, let's not overwork the humans even more. :)
} | ||
else | ||
{ | ||
return "matrix()"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return "matrix()"; | |
return "matrix(c(...))"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done without ...
😄
inline std::string ProgramCall(const std::string& programName) | ||
{ | ||
std::ostringstream oss; | ||
oss << "R> "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe set this as a command line prefix default variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function use to build markdown documentation for our website(For e.g. : https://www.mlpack.org/doc/mlpack-3.3.2/julia_documentation.html#adaboost ). This will be not used in our R package so I thought it may be a nice way to represent it , but I can change it as a command line prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prompt for an R workspace is just >
, so personally I think that could be okay to use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It's configurable; and some of us use R>
do make it more visually distinct. But either form works.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yashwants19 sorry for the late reply on this one. But, when I said:
Maybe set this as a command line prefix default variable?
My thought was:
std::string command_prefix = "R> ";
As the prefix is listed in two places within the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coatless I am really sorry, I misunderstood this part, I will update as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this in the last commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Sorry for the initial feedback not being concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @Yashwants19; great to see the final pieces coming together. I left some comments---hopefully they're helpful. 👍
CMakeLists.txt
Outdated
# Set minimum library version required by mlpack. | ||
set(ARMADILLO_VERSION "8.400.0") | ||
set(ENSMALLEN_VERSION "2.10.0") | ||
set(BOOST_VERSION "1.58") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is a good place to put all the dependencies! Do you want to move this to the top of the file so it's easier to find?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved this as suggested.
@@ -32,7 +32,7 @@ bindings to other languages. It is meant to be a machine learning analog to | |||
LAPACK, and aims to implement a wide array of machine learning methods and | |||
functions as a "swiss army knife" for machine learning researchers. In addition | |||
to its powerful C++ interface, mlpack also provides command-line programs, | |||
Python bindings, and Julia bindings. | |||
Python bindings, Julia bindings, Go bindings and R bindings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list is getting longer and longer... 🎉
Building the R bindings from scratch is a little more in-depth, though. For | ||
information on that, follow the instructions on the @ref build page, and be sure | ||
to specify @c -DBUILD_R_BINDINGS=ON to CMake; you may need to also set the | ||
location of the R program with @c -DR_EXECUTABLE=/path/to/R. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually I'm not sure it's needed to have an mlpack-r
repository. Based on my understanding, an upload to CRAN happens not by, e.g., giving the location of a Github repository (like in Julia), but instead by directly uploading a .tar.gz
. This means that instead of an mlpack-r
repository, we could do one of the following:
(1) at every mlpack release, manually build with make r
, package what's in build/src/mlpack/bindings/r/
and upload to CRAN
(2) whenever a new release is tagged, fire off a Travis or Azure job that does (1) automatically
(3) same as (2) except use Jenkins
(4) same as (2) except use Github Actions
Does that work? Or will there be people who... don't want to use CRAN, but install directly from a Github URL or something?
# Get the names of the movies for user 1. | ||
cat("Recommendations for user 1:\n") | ||
for (i in 1:10) { | ||
cat(" ", i, ":", as.character(movies[output$output[i], 3]), "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @coatless! @Yashwants19 I took a look through the code in this repository and in the RcppMLPACK
repository and I think that we aren't correctly handling the difference between R's 1-indexing and C++'s 0-indexing. I believe that we will need to modify, e.g., SetParamUCol()
and SetParamURow()
to subtract 1 from the values when converting to C++, and add 1 to the values when converting back to R. That is because when we represent labels in C++, we expect them to take values between 0 and (the number of classes - 1), whereas in R it seems like we would expect a user to provide data that has classes between 1 and the number of classes. Let me know if I overlooked something and the code already does that. 👍
std::is_same<T, arma::Col<size_t>>::value || | ||
std::is_same<T, arma::Mat<size_t>>::value) | ||
{ | ||
return "matrix(as.integer())"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to provide the R code for the default for an empty matrix of integer type... I think that c(...)
won't work since the user would have to fill out what ...
is. Let me know if I misunderstood---I'm certainly not an expert. But I did find that matrix(as.integer())
worked and gave an empty matrix (I think) but matrix(as.integer(c(...)))
failed:
> matrix(as.integer(c(...)))
Error in matrix(as.integer(c(...))) : '...' used in an incorrect context
"'factor'. These values will be converted to numeric indices before " | ||
"being passed to mlpack, and then inside mlpack they will be properly " | ||
"treated as categorical variables, so there is no need to do one-hot " | ||
"encoding for this matrix type."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice explanation! Always great when we can advertise that one-hot encoding isn't necessary, since it's a pain (both implementationally and computationally 😄)
@@ -21,4 +21,4 @@ add_python_binding(adaboost) | |||
add_julia_binding(adaboost) | |||
add_go_binding(adaboost) | |||
add_r_binding(adaboost) | |||
add_markdown_docs(adaboost "cli;python;julia;go" "classification") | |||
add_markdown_docs(adaboost "cli;python;julia;go;r" "classification") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see this too. :)
@@ -24,4 +24,4 @@ add_cli_executable(range_search) | |||
#add_julia_binding(range_search) | |||
#add_go_binding(range_search) | |||
#add_r_binding(range_search) | |||
add_markdown_docs(range_search "cli;go" "geometry") | |||
add_markdown_docs(range_search "cli" "geometry") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, nice catch. 👍
1978434
to
e803307
Compare
Co-authored-by: James J Balamuta <coatless@users.noreply.github.com>
Co-authored-by: Ryan Curtin <ryan@ratml.org> Co-authored-by: James J Balamuta <coatless@users.noreply.github.com>
e803307
to
e9f6225
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few final comments. I think from my end everything pretty much looks good if you can handle these comments; we still have time to figure out the actual deployment / mlpack-r
repository, so we don't have to resolve that issue now of course. 👍
Co-authored-by: Ryan Curtin <ryan@ratml.org>
…ts19/mlpack into R-markdown-documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's best for the matrix(as.integer(c()))
discussion, but that's the only comment I have remaining. I don't have any opposition if you want to merge as-is now and then we can handle that part later. 👍
I just noticed that. I only fairly recently grok'ed this myself but a nicer way instead if R> matrix(double(),0,0)
<0 x 0 matrix>
R> matrix(integer(),0,0)
<0 x 0 matrix>
R> It results in the desired type: R> storage.mode(matrix(integer(),0,0))
[1] "integer"
R> storage.mode(matrix(double(),0,0))
[1] "double"
R> |
Hi @eddelbuettel, I have updated the default value as suggested, Thank you for the suggestion you are the best. :) |
You're too kind, and you're doing really good work. And @rcurtin has an awesome attention to detail and infinite patience too :) |
Okay, lets merge this. :) |
updating my master branch
Hi @coatless @eddelbuettel @rcurtin, As per the timeline this PR is regarding the automatically generation of markdown documentation.
I have almost completed the work. In the following weeks I will open a PR in
mlpack/master
for a fine review before publishing the package toCRAN
.Thank You.